PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

introduction.tex (6294B)


      1 The rapid increase in the amount of published information brings forward the problem of how to handle large amounts of data.
      2 To this goal, \emph{information extraction} aims at discovering the underlying semantic structure of texts.
      3 As such, it is considered to be a part of natural language understanding.
      4 It is the link from unstructured text to structured data.
      5 Following Section~\ref{sec:context:knowledge base}, we will use knowledge bases as a formalization of structured data.
      6 However, to encompass the notion of information more appropriately, the concept of knowledge base needs to be taken in a broad sense.
      7 The strict definition of knowledge underlying most knowledge bases only includes general facts and does not encompass things such as ``Seneca is contemptuous even of the best garum.''
      8 However, this sentence conveys a piece of information that needs to be considered by information extraction systems.
      9 As such, we will consider text-specific facts such as ``Seneca \textsl{dislikes} garum'' to be facts belonging in a knowledge base.
     10 
     11 In this thesis, we focus on relation extraction, a subtask of information extraction.
     12 \begin{marginparagraph}
     13 	In contrast to relation extraction, when filling a template about an entity, the template has a fixed number of fields to be filled, in the language of Section~\ref{sec:context:relation algebra}, this means that all relations are left-total: \(r\relationComposition\breve{r}=r\relationComposition\breve{r}\relationOr\relationIdentity\).
     14 \end{marginparagraph}
     15 Precursors of relation extraction were the template filling tasks.
     16 In these tasks, objects corresponding to a given class---usually a specific kind of event---must be extracted from a text, and a template must be filled with information about this object.
     17 This was pioneered by \textcitex{syntactic_formatting} but started gathering interest with the message understanding conferences (\textsc{muc}) supported by \textsc{darpa}.%
     18 \sidenote{The Defense Advanced Research Projects Agency, a research agency of the \textsc{usa} Department of Defense.}
     19 The template filling task was formalized and evaluated in a systematic way starting with \textsc{muc-2}%
     20 \sidenote{At the time, the conference was known as \textsc{muck-ii}.}
     21 in 1989.
     22 But it was not until 1997 that \textsc{muc-7} formalized the modern relation extraction task.
     23 The \textsc{muc}s were succeeded by the automatic content extraction (\textsc{ace}) program convened by the \textsc{nist}%
     24 \sidenote{The National Institute of Standards and Technology, an agency of the \textsc{usa} Department of Commerce.}
     25 starting in 1999.
     26 
     27 The main information extraction task is known as \emph{knowledge base population} and consists in generating knowledge base facts from a set of documents.
     28 This task can be broken down into several steps, as illustrated by Figure~\ref{fig:relation extraction:ie steps}:
     29 \begin{description}
     30 	\item[Entity chunking] seeks to locate entities in text.
     31 		A similar task is named entity recognition (\textsc{ner}) which not only locates the entities but also assigns them with a type such as ``organization,'' ``person,'' ``location,'' etc.
     32 		The relation extraction datasets we consider in subsequent chapters do not include this entity-type information.
     33 		However, \textsc{ner} was more prevalent in relation extraction works during the 2000s decade.
     34 
     35 	\item[Entity linking] assigns a knowledge base entity identifier to a tagged entity in a sentence.
     36 		This disambiguates ``Paris, France'' \wdent{90}, from ``Paris, son of Priam, king of Troy'' \wdent{167646} and ``Paris, genus of the true lover's knot plant'' \wdent{162121}.
     37 		Following the above discussion on our broad sense of knowledge, an entity may not necessarily appear in an existing knowledge base, in which case the entity identifier can be taken to be the entity's surface form.
     38 
     39 	\item[Relation extraction] assigns a knowledge base relation identifier to an ordered pair of tagged entities in a sentence.
     40 		Paris is not only the capital of France, it is also located in France.
     41 		However, the sentence of Figure~\ref{fig:relation extraction:ie steps} does not convey the idea of location but the one of capital, thus predicting ``\textsl{located in country}'' \wdrel{17} would be incorrect there.
     42 \end{description}
     43 \begin{marginfigure}[-60mm]
     44 	\centering
     45 	\input{mainmatter/relation extraction/ie steps.tex}
     46 	\scaption[The three standard tasks for knowledge base population.]{
     47 		The three standard tasks for knowledge base population.
     48 		First, entity chunking locates the entities in the sentence, here ``Paris'' and ``France.''
     49 		Second, entity linking map each entity to a knowledge base identifier, here \wdent{90} and \wdent{142}.
     50 		Third, relation extraction find the relation linking the two entities, here \wdrel{1376} (\textsl{capital of}).
     51 	}
     52 	\label{fig:relation extraction:ie steps}
     53 \end{marginfigure}
     54 
     55 Whereas Chapter~\ref{chap:context} introduces the main tools used in relation extraction systems, the present chapter focuses on the relation extraction task itself.
     56 We formally define relation extraction in Section~\ref{sec:relation extraction:definition} and introduce its main variants encountered in the literature.
     57 A fundamental problem of relation extraction models is how to obtain supervision.
     58 Hand labeling a dataset is tedious and error-prone, so several alternative supervision techniques have been considered over the years; this is the focus of Section~\ref{sec:relation extraction:supervision}.
     59 We then introduce noteworthy supervised approaches--including weakly and semi-supervised ones---in Sections~\ref{sec:relation extraction:sentential} and~\ref{sec:relation extraction:aggregate}.
     60 As we will see in Section~\ref{sec:relation extraction:definition}, the task can be tackled at the sentence level or at a higher level.
     61 Section~\ref{sec:relation extraction:sentential} introduces sentence-level models, while Section~\ref{sec:relation extraction:aggregate} introduces higher-level models.
     62 Lastly, we delve into the main subject of this thesis, unsupervised relation extraction, in Section~\ref{sec:relation extraction:unsupervised}.
     63 Each of these sections is generally ordered following historical development, with older methods appearing first and current state-of-the-art appearing last.